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1. INTRODUCTION 

Sentiment analysis (SA) is a process of computing, identifying and categorizing opinions expressed 
in words. It has been used to discover the writer's opinion towards a topic, product or issue. 
Sentiment analysis can be categorised as Opinion Mining (OM) process that uses Natural Language 
Processing (NLP) to extract and identify information within selected words [1] Developing the sentiment 
analysis for specific language will eventually benefit its native speakers in several fields such as business, 
politics and entertainment. In this research, sentiment analysis tool that focuses on analyzing Malay language 
texts will be developed. We aim to develop a Malay sentiment analysis tool that can identify the opinion of 
internet users, specifically on Twitter by classifying "tweets" according their polarity. Generally, polarity can 
be divided into three parts which are positive, negative and neutral [2]. The polarity of opinions can be 
divided into two groups; prior polarity and contextual polarity. The prior polarity is the polarity that represent 
the words in the lexicon, while contextual polarity is the polarity of the expression in a word [2]. 
Generally, there are various possible sources to collect the required data to conduct the sentiment 
analysis study. Among the most popular sources of SA data are the Social Network Sites (SNS) such as 
Twitter, Facebook, blogs and online forums. The similarities of these SNS are, it contains comments and 
opinions posted by people where emotions and opinions can be expressed through texts and emoticons. 
Besides determining the polarity of words, another dimension that is interesting to explore is the strength and 
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intensity of the word polarity. As such, one should be able to identify the intensity level of certain words, 
where different numerical scores will be given to the respective words to indicate the intensity or the strength 
of an opinion dimension expressed in the sentences [3]. 

The majority of work in investigating the polarity classification and strength are conducted for 
English language [2-8] and the work to perform such classification in Malay language is still scarce. 
To address this issue, we propose a Malay Polarity Classification Tool (MaCT), which has the capacity to 
determine the polarity and intensity of the opinions from our data source, Twitter. This paper presents the 
development of a polarity classification tool called Malay Polarity Classification Tool (MaCT). The tool is 
developed using AFINN lexicon which is widely used in English language where the polarity of the words 
are identified and the accuracy is analyzed using our classification method. We have successfully grouped 
88.06% of the sentiment polarity of the tweets. For the classified tweets, we have managed to achieve 90% 
for the precision, recall and accuracy. 


Background and Related Work 

In general, there are three approaches for sentiment classification techniques: machine learning, 
lexicon based and hybrid approach. The Machine Learning Approach (ML) depends on the ML algorithm 
and applies the linguistic features, while the Lexicon-based Approach relies on sentiment lexicon, 
a collection of known and pre-compiled sentiment terms, and the Hybrid Approach is the combination of 
both approaches [1]. There exist a number of work in Sentiment Analysis, however, the majority of them are 
for English language. As such, the work of Bravo-Marquez et. al. suggest an approach to boost Twitter 
sentiment classification using different sentient dimensions, for example, opinion strength, emotion and 
polarity indicators, as meta-level features [3]. They found that the combination of sentiment dimensions 
provides significant improvement in Twitter sentiment classification tasks like polarity and subjectivity. 

Another work by Agrawal et al built a new method for sentiment classification. Firstly, 
SentiWordNet is used to assign scores to a sentence and later, uses heuristics to handle context dependent 
sentiment expressions. They purport that their method shows significant improvement on movie review 
dataset over the baseline data [4]. In a similar line, Nasukawa et al. present a sentiment analysis approach to 
extract sentiments associated with polarities of positive (favourable) or negative (unfavourable) opinions 
toward a specific subject. Ghey asserted that, in order to improve the sentiment analysis accuracy, 
it is important to properly identify the semantic relationships between the sentiment expressions and 
the subject [9]. From the perspective of sentiment analysis (SA) in Malay language, Handayani et al 
conducted a Systematic Literature Review (SLR) on the publications on SA, and the main focus is on the 
work done for Malay language [10]. They reported that, four out of ten papers on Malay SA utilize the 
Lexicon-based approach, while three papers use Machine Learning approach, two papers apply the Rule- 
based approach and one uses the Hybrid approach [10]. Furthermore, [11] tried to improve the classification 
performances by proposing a Malay SA classification model based on the semantic orientation and machine 
learning approaches. They collected 2,478 Malay sentiment-lexicon phrases and words assign synonyms to 
each word, and later, and the polarity is manually assigned with a score. A research conducted by Al-Moslmi 
et al highlights the effects of the common-used feature selection methods (Information Gain, Gini Index, and 
Chi-squared), and three machine learning classifiers (SVM, Naive Bayes, and K-nearest neighbor) for Malay 
sentiment classification. They conducted series of experiments using a Malay Opinion Corpus and found that 
the feature selection techniques improve the performance of the Malay sentiment-based classification [12]. 
The remainder of this paper is organized as follows: The next section highlights the methodology used in this 
research, while Section 3 discusses the results of the experiment. Finally, Section 4 concludes this paper and 
gives an overview of future work. 


2. RESEARCH METHOD 

Sentiment analysis research has been accelerated with the development of several lexical resources. 
Many sentiment classification tasks utilizes opinion words. Generally, positive opinion words are used to 
show desired states, while negative opinion words usually express undesired states. There are three main 
approaches to collect opinion word list. The first one is manual approach which is very time consuming, 
and usually combined with two automated methods, dictionary-based approach and corpus-based approach. 
Essentially, there are several classification approaches commonly used for sentiment analysis, i.e. 
OpinionFinder Lexicon [13], AFINN Lexicon [14], SentiWordNet Lexicon [15], SentiStrength Method [16], 
Sentiment140 Method [17], NRC Lexicon [18]. 
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2.1. AFINN lexicon 

Bradley and Lang coined a lexicon called Affective Norms for English Words lexicon (ANEW), 
which become the basis for AFINN lexicon [14]. The ANEW lexicon gives emotional ratings for vast 
number of English words. This lexicon include “valence”, which rates the psychological reaction of a person 
to a specific word. ANEW mostly covers formal language and does not include slang words commonly used 
in microblogging. Therefore, to address this gap, [19] created the AFINN lexicon, which covers the language 
in microblogging platforms. In this lexicon, slang, acronyms, and web jargon are also included. Positive 
words are scored from | to 5 and negative words from -1 to -5, and it is most suitable for strength estimation, 
and contains about 2477 words. 

After deliberating on the possible lexicons to be used in our research, we decided to use AFINN 
lexicon because it is the most suitable to determine the polarity in microblogging and social media [20]. 
As such, we applied the words in AFINN lexicon to our data source, Twitter. However, the words contained 
in AFINN database are in English, whereas, our study is on the polarity of Malay words in Twitter. 
In order to enable the usage of AFINN for Malay language, we have translated each word in AFINN 
from English to Malay, including the polarity of the words. The words were compared to several Malay 
corpora hosted in the Malay Online Virtual Integrated Corpus (MOVIC) [21]. The translated Malay lexicon 
contains the exact translation of the English words in AFINN, and known as Malay Polarity Classification 
Tool (MaCT). The results of the translation is shown in Figure 1, where the English words in AFINN are 
translated to Malay. From Figure 1, we can see that, each word is given its polarity value, whether positive, 
negative and neutral. For example, the word “meninggalkan” is given a negative value of 2, which means the 
intensity of word is of Level 2, compared to the word “tidak hadir’” which is given negative value of 1, 
which relatively means, is less intense than “meninggalkan”’. 

The sentiment analysis classification is done by collecting the Malay tweets on Twitter. 
For the purpose of collecting the tweets, Twitter4j [22], a Twitter API for JAVA, was used. 
This API provides many functionalities such as filtering tweets by time or by a hashtag. The processes of the 
polarity classification of the tweets in Twitter are depicted in Figure 2. From the figure, we found that when a 
word search is chosen, the system will extract the data from Twitter which later will be analysed to determine 
its polarity. Generally, the collected tweets contained noises that need to be removed, and several text 
pre-processing methods are applied to delete the noises [23]: 
1. Tokenization 
2. Removing numbers that do not express any emotions or attitudes. 
3. Removing Punctuations 


The next step is to extract the relevant tweets based on the search keywords, and _ later, 
the polarity is determined based on AFINN lexicon. Lastly, the polarity values are calculated using the 
polarity calculation algorithm. 


“meninggalkan” 
“terbengkaloi" 
"mengabaikan" 
“diculik" 
“penculikan" 
“penculikan" 
“jijik": -3, 
“membenci": -3, 
“hina” 3, 
“membenci" 3, 
“kebolehan” 


di 
“penyalahgunaan": -3, 
“kesat" 


Figure 1. AFINN’s Malay translation 
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¥ 


Titer ene men [Tweets | me [Tees ecg 
API (Positive/Negative/Neutral) 


[Polarity Calculation _| Calculation == Polarity Determination ea Feature Extraction 
Algorithm using AFINN 


Figure 2. Polarity Classification process 


2.2. Negation scoring strategy 

Negation scoring strategy is very important feature for this sentiment analysis tool [24]. 
It helps the sentiment analysis tool to assign correct polarity when there are negation words in the text. 
Without negation scoring strategy words such as “tidak bagus” will be detected as positive as “bagus” is a 
token rated with positive value. The negation words in Malay language are shown in Figure 3. 
Generally, the orientation of the tweet polarity depends on the negation of words, where the meaning will 
totally change when negation is involved [25]. For example, the Tweet 1 is a positive tweet, in which the 
polarity is scope by the expression "bagus": 
1. Kerja anda bagus. (Positive Tweet). 
The tweet 2 use the form of the negation" tidak" is the negation of the Tweet 1: 
2. Kerja anda tidak bagus. (Negative Tweet) 
The processes of filtering the negation words are illustrated in Figure 4. First, the process starts with the 
identification of the word before the word token, if the negation word is found, a negative score will be given 
to the word token. 


negators.json x 
{ 
. = Identify | Check next 
"tidak": 1, P} word before , Negation No! Is there any 
"tak": 1, token Words token 
“bukan": 1, 
"enggan": 1, 
"taknak": 1, 
"wa Laupon" :1 Times token 
value with -1 *:,, 
} 
Figure 3. Malay negation words Figure 4. Processes to filter the negation words 


The collection of words collected from Twitter was later tested using two steps: run half of the data 
as the training data and another half is treated as test data. The equations for the overall accuracy, 
precision and Recall are given below. 


Overall accuracy = (TP+TN)(TP+FP+TN+FN) (1) 
Precision = (TP)(TP+FP) (2) 
Recall = (TP)(TP+FN) (3) 


Where TP, TN, FP, and FN are number of true positives, true negatives, false positives, 
and false negatives, respectively. 


3. RESULTS AND DISCUSSION 

In general, our contribution in this paper are twofold: 1) The new AFINN version of Malay words, 
together with their polarity value (positive/negative/neutral) and 2) The polarity classification technique to 
identify the positive, negative or neutral tweets using several feature extraction techniques. Essentially, many 
researchers have used several statistical analysis techniques to analyze the collected data. The work by 
Iyer et al discusses the usage of deep neural network to analyse the data from a system that can detect, 
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classify and count vehicle that pass through the dedicated camera [26]. In a similar line, Al-Hagery attempted 
to extract the patterns from the dates product dataset using the machine learning technique, which is based on 
the association rules generation [27]. In addition, Sathyavikasini and Vijaya highlight the technique they use 
to distinguish the type of disease and they trained the classifiers using the supervised pattern 
learning technique [28]. Referring to Figure | in Section 2.1, the development of Malay version of AFINN 
has given the opportunity for Malay sentiment analysis researchers to study the polarity and also the strength 
of the words to further add more values of meanings in their work. Based on the literature, we first 
experimented using three feature extraction methods, based on the machine learning classifiers, SVM, Naive 
Bayes (NB) and K-nearest neighbour (KNN). The validation of our polarity classification method is done by 
randomly collecting 400 positive tweets, 400 negative tweets and 200 neutral tweets. Later, we gathered 100 
respondents to label the words into positive and negative list and they are compared to our classification 
method. The next step is to compute the classification accuracy, precision and recall for our classification 
method using the formulas 1, 2 and 3 in Section 2.2. The results of the evaluation are shown in Table 1, 
where our classification method has successfully grouped 88.06% of the sentiment polarity of the tweets. For 
the classified tweets, we have managed to achieve 90% for the precision, recall and accuracy. 


Table 1. Evaluation for Precision, Recall and Accuracy for our classification method 


Percentage of words classified Method Evaluation 
Precision Recall Accuracy 
88.06% 90% 90% 90% 


From the results shown in Table 1, we can derive the polarity values from the list of words 
translated from AFINN database and this has solved the problem of having difficulty of finding the tool to 
work for the research in Malay language. 


4. CONCLUSION 

The development of sentiment lexicons in English language has been very useful for researchers over 
the years. A number of sentiment lexicons have been introduced, and one of them is AFINN. However, none 
of the lexicons are developed for Malay language. Thus, we have chosen to translate the words in AFINN 
and as a result, we got a comprehensive sentiment lexicon for Malay language. The main contribution of our 
research is the development of the Malay Polarity Classification Tool (MaCT) to classify sentiment lexicon 
for Malay language. The reason we choose AFINN is, it is the most suitable lexicon for social media data. 
The data for our experiment come from Twitter, and we use our method to classify the tweets to positive, 
negative and neutral. The validation results are excellent where our method scored 90% for precision, recall 
and accuracy. At the end of our research, we manage to meet the objectives of our study, which are, to find 
the polarity of the words in Twitter and perform the sentiment analysis based on the polarity. For future 
work, we plan to develop a more complete lexicon for Malay language so that we can obtain higher precision 
and accuracy score. The work in sentiment analysis in Malay language is still developing and we plan to 
continue the work to create a better sentiment lexicon tool. 
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